You can define bullet list or numbered list:
Here you can define formula
\[ y = \beta_0 + \beta_1X + \epsilon \] the formula can be reported in the text: \(\mu = 1/n \sum X_i\)
data_SD3 <- read.delim("~/RProjects/2024-Q2-R-2 [MDA2024, exercises]/D1_SD3/data_SD3.csv", stringsAsFactors=TRUE)
Forbes2000 <- read.csv("~/RProjects/2024-Q2-R-2 [MDA2024, exercises]/D0_Data/Forbes2000.csv", stringsAsFactors=TRUE)
glass <- read.csv("~/RProjects/2024-Q2-R-2 [MDA2024, exercises]/D0_Data/glass.csv", stringsAsFactors=TRUE)
To add R code in the Notebook we need to use the Chunk.
X <- iris
It is possible to have an overview of the data by using the summary function.
summary(X)
Sepal.Length Sepal.Width Petal.Length
Min. :4.300 Min. :2.000 Min. :1.000
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
Median :5.800 Median :3.000 Median :4.350
Mean :5.843 Mean :3.057 Mean :3.758
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
Max. :7.900 Max. :4.400 Max. :6.900
Petal.Width Species
Min. :0.100 setosa :50
1st Qu.:0.300 versicolor:50
Median :1.300 virginica :50
Mean :1.199
3rd Qu.:1.800
Max. :2.500
In R there are three main type of data:
Y <- as.matrix(X[ ,1:4])
To handle data you can use the following code:
X[10, 2] # selection of one element in the Data Frame (or matrix)
[1] 3.1
X[5:20, 1:3] # selection of an interval
X[5:20, ] # the empty space select all the columns or rows
X$Sepal.Length # the symbol $ is used to select a column in the data frame
[1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8
[14] 4.3 5.8 5.7 5.4 5.1 5.7 5.1 5.4 5.1 4.6 5.1 4.8 5.0
[27] 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 4.4
[40] 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 7.0 6.4
[53] 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6
[66] 6.7 5.6 5.8 6.2 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7
[79] 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 5.6 5.5 5.5
[92] 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 6.3 5.8 7.1 6.3
[105] 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5
[118] 7.7 7.7 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2
[131] 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 6.0 6.9 6.7 6.9 5.8
[144] 6.8 6.7 6.7 6.3 6.5 6.2 5.9
boxplot(X$Sepal.Length, main = "Box Plot of the Sepal Length of the IRIS Flowers", col = "blue", horizontal = T)
Error in if (horizontal) plot.window(ylim = xlim, xlim = ylim, log = log, :
the condition has length > 1
The elements of the box-plot are reported below:
# just a boxplot
boxplot(X[ ,1:4])
# boxplot with a title and colors
boxplot(X[ ,1:4], main = "Box Plot of all quantitative variables of IRIS data", col = terrain.colors(4))
boxplot(X$Sepal.Width ~ X$Species, main = "Box Plot of Sepal Width considering the 3 types of flowers", xlab = "Type of IRIS Flowers", ylab = "Sepal Width", col = terrain.colors(4))
The tilde symbol is obtained by:
The bold line is the Median, that is the value of the ordered distribution that leaves the same number of units above and below (or on left and right)
boxplot(X$Sepal.Length, main = "Box-Plot of the Sepal Lenght", col = "green", horizontal = F)
boxplot(X[ ,1:4], main = "Box-Plot with all the Variables", col = "blue", horizontal = F)
boxplot(X$Sepal.Width ~ X$Species, main = "Box-Plot about Sepal Width with different type of IRIS Flowers")
Bar Plot can be used for Qualitative Data and for Categorized Quantitative Data. The first step to create a Bar Plot is to generate a Table of Frequency.
T <- table(X$Species)
T
setosa versicolor virginica
50 50 50
barplot(T, main = "Bar Plot of Type of flowers", xlab = "Type of flowers", ylab = "Absolute Frequency", col = terrain.colors(4))
It is based on the frequency table.
pie(T, main = "Pie Chart", col = terrain.colors(4))
Histogram is a plot used only for Quantitative Data, it is based on a frequency tables in classes. The R function is called hist and the input is a simple distribution of a quantitative variable.
hist(X$Sepal.Length)
hist(X$Sepal.Width, main = "Histogram of Sepal Width", xlab = "Classes", ylab = "Absolute Frequency", col = "lightgreen", border = "blue")
he histogram can be used only for quantitative variables.
hist(X$Sepal.Width, main = "Histogram of the Sepal Width", xlab = "Classes",
ylab = "Absolute Frequency", col = "green", border = "red", breaks = 10)
In case of equally spaced (same size) classes we can report on the Y axis the Absolute Frequency or relative frequency. In case of classes with different sizes we have to report on Y axis the density of frequency. The formula is the following: \(d_i = n_i/h_i\), where \(n_i\) is the absolute frequency and \(h_i\) is the size of the class.
plot(X$Sepal.Length, X$Sepal.Width, main = "Correlation Plot", xlab = "Sepal Length", ylab = "Sepal Width")
plot(X$Petal.Length, X$Petal.Width, main = "Correlation Plot",
xlab = "Petal Lenght", ylab = "Petal Width", col = "blue",
pch = 1)
Plots with IRIS
plot(X$Sepal.Length, X$Sepal.Width, main = "(1-2) Plot with IRIS",
xlab = "Sepal Lenght", ylab = "Sepal Width", col = "blue")
plot(X$Petal.Length, X$Petal.Width, main = "(2-2) Plot with IRIS",
xlab = "Petal Lenght", ylab = "Petal Width", col = "red")
pairs(X[ ,1:4])
The plots below the main diagonal are the same of the plot above the main diagonal. The reason is because the plot and the correlation index are symmetric.
r <- cor(X[ ,1:2])
r <- round(r, 3)
r
Sepal.Length Sepal.Width
Sepal.Length 1.000 -0.118
Sepal.Width -0.118 1.000
cor(X$Sepal.Length, X$Sepal.Width)
[1] -0.1175698
The range of correlation index is: -1 <= r <= 1 The interpretation of the Correlation Index called r is following:
The correlation between Sepal Length and Sepal Width is -0.118 and it is a low negative correlation.
install.packages("ggplot2")
Error in install.packages : Updating loaded packages
library(ggplot2)
GGPLOTS has 3 main arguments:
# GGPLOT (example no 1)
ggplot(data = X[ ,1:2])
# GGPLOT (example no 2.1)
ggplot(data = X[ ,1:2]) +
geom_point(mapping = aes(X$Sepal.Length, X$Sepal.Width))
# GGPLOT (example no 2.2)
ggplot(data = X[ ,1:2]) +
geom_point(mapping = aes(Sepal.Length, Sepal.Width))
# GGPLOT (example no 2.2)
ggplot(data = X) +
geom_point(mapping = aes(Sepal.Length, Sepal.Width))
# GGPLOT (example no 3)
ggplot(data = X) +
geom_point(mapping = aes(Sepal.Length, Sepal.Width)) +
ggtitle("Scatter Plot") + xlab("Sepal Length") + ylab("Sepal Width")
ggplot(data = X) +
geom_point(mapping = aes(Sepal.Length, Sepal.Width, color = Species))
STANDARD TEMPLATE IS: ggplot(data = ) +
# First Example
ggplot(data = X) +
geom_point(mapping = aes(Petal.Length, Petal.Width, color = Species)) +
ggtitle("Petal Lenght and Width") +
xlab("Petal Lenght") + ylab("Petal Width")
# Second Example
ggplot(data = X, mapping = aes(Petal.Width, Petal.Length)) +
geom_point(mapping = aes(color = Species)) +
ggtitle("Petal Width and Lenght") +
xlab("Petal Width") + ylab("Petal Lenght")
ggplot(data = X) +
geom_boxplot(mapping = aes(Sepal.Width), color = "blue", outlier.colour = "red", outlier.shape = 8, outlier.size = 3) +
ggtitle("Box Plot for Sepal Lenght")
Box Plot taking into account the 3 types of flowers
ggplot(data = X) +
geom_boxplot(mapping = aes(Species, Sepal.Width), outlier.color = "red", outlier.shape = 8)
GGPLOT function as object
p <- ggplot(data = X) +
geom_boxplot(mapping = aes(Species, Sepal.Width, fill = Species))
p
p + theme(legend.position = "bottom")
ggplot(data = X) +
geom_bar(mapping = aes(Species)) +
ggtitle("Bar Plot with GGPLOT") +
ylab("Absolute Frequency")
A data frame with 234 rows and 11 variables:
Y <- mpg
summary(Y)
manufacturer model displ
Length:234 Length:234 Min. :1.600
Class :character Class :character 1st Qu.:2.400
Mode :character Mode :character Median :3.300
Mean :3.472
3rd Qu.:4.600
Max. :7.000
year cyl trans
Min. :1999 Min. :4.000 Length:234
1st Qu.:1999 1st Qu.:4.000 Class :character
Median :2004 Median :6.000 Mode :character
Mean :2004 Mean :5.889
3rd Qu.:2008 3rd Qu.:8.000
Max. :2008 Max. :8.000
drv cty hwy
Length:234 Min. : 9.00 Min. :12.00
Class :character 1st Qu.:14.00 1st Qu.:18.00
Mode :character Median :17.00 Median :24.00
Mean :16.86 Mean :23.44
3rd Qu.:19.00 3rd Qu.:27.00
Max. :35.00 Max. :44.00
fl class
Length:234 Length:234
Class :character Class :character
Mode :character Mode :character
head(Y)
table(Y$cyl)
4 5 6 8
81 4 79 70
# first example: factor(cyl) as a color ( vertical bar chart )
ggplot(data = Y) +
geom_bar(mapping = aes(cyl, fill = factor(cyl)))
# second example: class as a color ( vertical bar chart )
ggplot(data = Y) +
geom_bar(mapping = aes(cyl, fill = class))
# third example: class instead cyl ( vertical bar chart )
ggplot(data = Y) +
geom_bar(mapping = aes(cyl, fill = class))
# fourth example: class instead cyl ( horizontal bar chart )
ggplot(data = Y) +
geom_bar(mapping = aes(cyl, fill = class)) + coord_flip()
# fifth example: class instead cyl ( horizontal bar chart & legend at the bottom )
ggplot(data = Y) +
geom_bar(mapping = aes(cyl, fill = class)) + coord_flip() + theme(legend.position = "bottom")
# example no 1.1
ggplot(data = Y) +
geom_histogram(mapping = aes(cty))
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.
# example no 1.2
ggplot(data = Y) +
geom_histogram(mapping = aes(cty, colour = class))
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.
# example no 1.3
ggplot(data = Y) +
geom_histogram(mapping = aes(cty, fill = class))
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.
# example no 1.4
ggplot(data = Y) +
geom_histogram(mapping = aes(cty, fill = "red"))
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.
# example no 1.5
ggplot(data = Y) +
geom_histogram(mapping = aes(cty, fill = factor(cty)))
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.
# example 2
ggplot(data = Y) +
geom_histogram(mapping = aes(hwy))
`stat_bin()` using `bins = 30`. Pick better value with
`binwidth`.
With Facet Wrap layer (option) we can create sub-plots based on a categorical variable.
ggplot(data = Y) +
geom_point(mapping = aes(displ, hwy)) +
facet_wrap(~drv)
ggplot(data = Y) +
geom_point(mapping = aes(displ, hwy)) +
facet_wrap(drv ~ class)
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy))
`geom_smooth()` using method = 'loess' and formula = 'y
~ x'
# First Example
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy, linetype = class))
`geom_smooth()` using method = 'loess' and formula = 'y
~ x'
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
span too small. fewer data values than degrees of freedom.
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
pseudoinverse used at 5.6935
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
neighborhood radius 0.5065
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
reciprocal condition number 0
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
There are other near singularities as well. 0.65044
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
span too small. fewer data values than degrees of freedom.
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
pseudoinverse used at 5.6935
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
neighborhood radius 0.5065
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
reciprocal condition number 0
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
There are other near singularities as well. 0.65044
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
pseudoinverse used at 4.008
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
neighborhood radius 0.708
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
reciprocal condition number 1.6135e-17
Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric, :
There are other near singularities as well. 0.25
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
pseudoinverse used at 4.008
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
neighborhood radius 0.708
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
reciprocal condition number 1.6135e-17
Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x else if (is.data.frame(newdata)) as.matrix(model.frame(delete.response(terms(object)), :
There are other near singularities as well. 0.25
# Second Example
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy, linetype = drv))
`geom_smooth()` using method = 'loess' and formula = 'y
~ x'
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy)) +
facet_wrap(~ drv)
`geom_smooth()` using method = 'loess' and formula = 'y
~ x'
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy, color = drv))
`geom_smooth()` using method = 'loess' and formula = 'y
~ x'
ggplot(data = Y) +
geom_smooth(mapping = aes(displ, hwy, group = drv))
`geom_smooth()` using method = 'loess' and formula = 'y
~ x'
ggplot(data = Y) +
geom_point(mapping = aes(displ, hwy, color = drv)) +
geom_smooth(mapping = aes(displ, hwy, color = drv))
`geom_smooth()` using method = 'loess' and formula = 'y
~ x'
In regression model we need to define the dependent and independent variables. In our case the model is define as follow:
In the first place we need to create a scatter plot (correlation plot).
ggplot(data = Y) +
geom_point(mapping = aes(displ, hwy)) +
geom_smooth(method = lm, mapping = aes(displ, hwy))
`geom_smooth()` using formula = 'y ~ x'
ggplot(data = Y) +
geom_point(mapping = aes(displ, hwy, color = drv)) +
geom_smooth(method = lm, mapping = aes(displ, hwy, color = drv))
`geom_smooth()` using formula = 'y ~ x'
res.reg <- lm(hwy ~ displ, data = Y)
summary(res.reg)
Call:
lm(formula = hwy ~ displ, data = Y)
Residuals:
Min 1Q Median 3Q Max
-7.1039 -2.1646 -0.2242 2.0589 15.0105
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 35.6977 0.7204 49.55 <2e-16 ***
displ -3.5306 0.1945 -18.15 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.836 on 232 degrees of freedom
Multiple R-squared: 0.5868, Adjusted R-squared: 0.585
F-statistic: 329.5 on 1 and 232 DF, p-value: < 2.2e-16
The regression coefficients are defined as follow:
The results of the regression model showed that the estimated \(\beta_0\) (the intercept) is equal to 35.6977 and the estimated \(\beta_1\) (the regression coefficient/slope) is equal to -3.3506.
The quality of the model is measured by the godness of fix index \(R^2\) that is for this case study equal to 0.585. The \(R^2\) is interpreted as follow:
In this case study we have a Medium-High Quality of the model.
To test the regression parameters it is possible to use the t-value or the p-value, as reported below:
In this case study the estimated regression coefficient \(\beta_1\) is statistically significant, since the t-value is -18.15 (< 2) and the p-value is 2e-16 (< 5%). It means that the number of mile per gallon (hwy) depends by the power of the engine (displ), so by increasing the displ of 1 unit, the hwy will decrease by -3.5306.
The Principal Component Analysis has the following aims and characteristics:
library(FactoMineR)
library(ggplot2)
library(factoextra)
For the PCA we will use the PCA function in the Package FactoMineR.
Y <- mpg
res.pca <- PCA(Y[,c(3,5,8,9)])
summary(res.pca)
Call:
PCA(X = Y[, c(3, 5, 8, 9)])
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4
Variance 3.509 0.379 0.071 0.041
% of var. 87.736 9.477 1.772 1.014
Cumulative % of var. 87.736 97.214 98.986 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2 Dim.2 ctr
1 | 2.002 | 1.832 0.408 0.837 | -0.600 0.406
2 | 2.211 | 2.190 0.584 0.981 | -0.289 0.094
3 | 2.202 | 2.160 0.568 0.963 | -0.129 0.019
4 | 2.203 | 2.196 0.587 0.994 | -0.118 0.016
5 | 0.709 | 0.336 0.014 0.225 | -0.077 0.007
6 | 0.731 | 0.575 0.040 0.619 | 0.130 0.019
7 | 0.720 | 0.543 0.036 0.568 | 0.340 0.130
8 | 1.822 | 1.581 0.304 0.753 | -0.879 0.872
9 | 1.781 | 1.258 0.193 0.499 | -1.180 1.569
10 | 1.954 | 1.910 0.444 0.955 | -0.408 0.188
cos2 Dim.3 ctr cos2
1 0.090 | -0.187 0.212 0.009 |
2 0.017 | -0.087 0.046 0.002 |
3 0.003 | -0.056 0.019 0.001 |
4 0.003 | 0.000 0.000 0.000 |
5 0.012 | -0.498 1.493 0.493 |
6 0.032 | -0.431 1.119 0.348 |
7 0.222 | -0.290 0.506 0.162 |
8 0.233 | -0.122 0.089 0.004 |
9 0.439 | -0.167 0.167 0.009 |
10 0.044 | 0.010 0.001 0.000 |
Variables
Dim.1 ctr cos2 Dim.2 ctr cos2
displ | -0.932 24.773 0.869 | 0.309 25.155 0.095 |
cyl | -0.933 24.821 0.871 | 0.307 24.860 0.094 |
cty | 0.951 25.754 0.904 | 0.271 19.334 0.073 |
hwy | 0.930 24.652 0.865 | 0.341 30.652 0.116 |
Dim.3 ctr cos2
displ 0.187 49.213 0.035 |
cyl -0.183 47.082 0.033 |
cty 0.038 2.010 0.001 |
hwy -0.035 1.694 0.001 |
Z <- mtcars
res.pca1 <- PCA(Z)
Warning: ggrepel: 17 unlabeled data points (too many overlaps). Consider increasing max.overlaps
summary(res.pca1)
Call:
PCA(X = Z)
Eigenvalues
Dim.1 Dim.2 Dim.3 Dim.4
Variance 6.608 2.650 0.627 0.270
% of var. 60.076 24.095 5.702 2.451
Cumulative % of var. 60.076 84.172 89.873 92.324
Dim.5 Dim.6 Dim.7 Dim.8
Variance 0.223 0.212 0.135 0.123
% of var. 2.031 1.924 1.230 1.117
Cumulative % of var. 94.356 96.279 97.509 98.626
Dim.9 Dim.10 Dim.11
Variance 0.077 0.052 0.022
% of var. 0.700 0.473 0.200
Cumulative % of var. 99.327 99.800 100.000
Individuals (the 10 first)
Dist Dim.1 ctr cos2
Mazda RX4 | 2.234 | -0.657 0.204 0.087 |
Mazda RX4 Wag | 2.081 | -0.629 0.187 0.091 |
Datsun 710 | 2.987 | -2.779 3.653 0.866 |
Hornet 4 Drive | 2.521 | -0.312 0.046 0.015 |
Hornet Sportabout | 2.456 | 1.974 1.844 0.646 |
Valiant | 3.014 | -0.056 0.001 0.000 |
Duster 360 | 3.187 | 3.003 4.264 0.888 |
Merc 240D | 2.841 | -2.055 1.998 0.523 |
Merc 230 | 3.733 | -2.287 2.474 0.375 |
Merc 280 | 1.907 | -0.526 0.131 0.076 |
Dim.2 ctr cos2 Dim.3 ctr
Mazda RX4 1.735 3.551 0.604 | -0.601 1.801
Mazda RX4 Wag 1.550 2.833 0.555 | -0.382 0.728
Datsun 710 -0.146 0.025 0.002 | -0.241 0.290
Hornet 4 Drive -2.363 6.584 0.879 | -0.136 0.092
Hornet Sportabout -0.754 0.671 0.094 | -1.134 6.412
Valiant -2.786 9.151 0.855 | 0.164 0.134
Duster 360 0.335 0.132 0.011 | -0.363 0.656
Merc 240D -1.465 2.531 0.266 | 0.944 4.439
Merc 230 -1.984 4.639 0.282 | 1.797 16.094
Merc 280 -0.162 0.031 0.007 | 1.493 11.103
cos2
Mazda RX4 0.072 |
Mazda RX4 Wag 0.034 |
Datsun 710 0.007 |
Hornet 4 Drive 0.003 |
Hornet Sportabout 0.213 |
Valiant 0.003 |
Duster 360 0.013 |
Merc 240D 0.110 |
Merc 230 0.232 |
Merc 280 0.613 |
Variables (the 10 first)
Dim.1 ctr cos2 Dim.2 ctr
mpg | -0.932 13.143 0.869 | 0.026 0.026
cyl | 0.961 13.981 0.924 | 0.071 0.191
disp | 0.946 13.556 0.896 | -0.080 0.243
hp | 0.848 10.894 0.720 | 0.405 6.189
drat | -0.756 8.653 0.572 | 0.447 7.546
wt | 0.890 11.979 0.792 | -0.233 2.046
qsec | -0.515 4.018 0.266 | -0.754 21.472
vs | -0.788 9.395 0.621 | -0.377 5.366
am | -0.604 5.520 0.365 | 0.699 18.440
gear | -0.532 4.281 0.283 | 0.753 21.377
cos2 Dim.3 ctr cos2
mpg 0.001 | -0.179 5.096 0.032 |
cyl 0.005 | -0.139 3.073 0.019 |
disp 0.006 | -0.049 0.378 0.002 |
hp 0.164 | 0.111 1.960 0.012 |
drat 0.200 | 0.128 2.598 0.016 |
wt 0.054 | 0.271 11.684 0.073 |
qsec 0.569 | 0.319 16.255 0.102 |
vs 0.142 | 0.340 18.388 0.115 |
am 0.489 | -0.163 4.234 0.027 |
gear 0.567 | 0.229 8.397 0.053 |
The eigenvalues measure the level of information retained by the PCA. They show the explained variance from the data and help us to select the number of Latent Variables (or Latent Traits, Principal Components, Dimensions). The rules to select the number of LVs are: - Eigenvalues greater than 1. \(\lambda > 1\). - Cumulative percentage of explained variance > 0.7 (70%)
In our case study we select two components. Both components have eigenvalue greater than 1 and with and Cumulative percentage of explained variance equal to 84.172%.
round(res.pca1$eig,3)
eigenvalue percentage of variance
comp 1 6.608 60.076
comp 2 2.650 24.095
comp 3 0.627 5.702
comp 4 0.270 2.451
comp 5 0.223 2.031
comp 6 0.212 1.924
comp 7 0.135 1.230
comp 8 0.123 1.117
comp 9 0.077 0.700
comp 10 0.052 0.473
comp 11 0.022 0.200
cumulative percentage of variance
comp 1 60.076
comp 2 84.172
comp 3 89.873
comp 4 92.324
comp 5 94.356
comp 6 96.279
comp 7 97.509
comp 8 98.626
comp 9 99.327
comp 10 99.800
comp 11 100.000
Below we can read the table with correlations between the Manifest Variables (MVs) and the Latent Variables (Dimensions). The correlations are also the coordinates of the manifest variables on the variables plot.
For the interpretation:
In our case study, the LVs (dimensions) are so defined:
round(res.pca1$var$coord[,1:2],3)
Dim.1 Dim.2
mpg -0.932 0.026
cyl 0.961 0.071
disp 0.946 -0.080
hp 0.848 0.405
drat -0.756 0.447
wt 0.890 -0.233
qsec -0.515 -0.754
vs -0.788 -0.377
am -0.604 0.699
gear -0.532 0.753
carb 0.550 0.673
it is possible to evaluate the correlations between the manifest variables by using the angle between the vectors. For instance, the variables hp, cyl, disp and wt are all positively correlated since all the angles are less then 90°.
plot.PCA(res.pca1, axes = c(1,2), choix = "var")
The quality of the projected manifest variables is given by the Cosine squared.
Below you can find the interpretation intervals:
round(res.pca1$var$cos2[,1:2],3)
Dim.1 Dim.2
mpg 0.869 0.001
cyl 0.924 0.005
disp 0.896 0.006
hp 0.720 0.164
drat 0.572 0.200
wt 0.792 0.054
qsec 0.266 0.569
vs 0.621 0.142
am 0.365 0.489
gear 0.283 0.567
carb 0.303 0.453
In the units plot we can evaluate the clusters of units. The clusters define group of units that are similar on the basis of the Latent Variables estimated.
plot.PCA(res.pca1, axes = c(1,2), choix = "ind")
Warning: ggrepel: 17 unlabeled data points (too many overlaps). Consider increasing max.overlaps
library(factoextra)
fviz_pca_var(res.pca1)
fviz_pca_ind(res.pca1)
Biplot is a plot where both units and variables are represented.
fviz_pca_biplot(res.pca1)
fviz_screeplot(res.pca1)
fviz_pca_contrib(res.pca1, choice = "var")
Warning in fviz_pca_contrib(res.pca1, choice = "var") :
The function fviz_pca_contrib() is deprecated. Please use the function fviz_contrib() which can handle outputs of PCA, CA and MCA functions.